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Abstract 

By a "covering" wo mean a Gaussian mixture model fit to observed 
data. Approximations of the Bayes factor can be availed of to judge model 
fit to the data within a given Gaussian mixture model. Between families of 
^ Gaussian mixture models, we propose the Renyi quadratic entropy as an 

excellent and tractable model comparison framework. We exemplify this 
using the segmentation of an MRI imago volume, based (1) on a direct 
Gaussian mixture model applied to the marginal distribution function, 
and (2) Gaussian model fit through k-means applied to the 4D multivalued 
image volume furnished by the wavelet transform. Visual preference for 
one model over another is not immediate. The Renyi quadratic entropy 
allows us to show clearly that one of these modelings is superior to the 
other. 

Keywords: image segmentation; clustering; model selection; minimum descrip- 
tion length; Bayes factor, Renyi entropy. Shannon entropy 
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1 Introduction 



We begin with some terminology used. Segments are contiguous clusters. In 
an imaging context, this means that clusters contain adjacent or contiguous 
pixels. For typical 2D (two-dimensional) images, we may also consider the ID 
(one-dimensional) marginal which provides an empirical estimate of the pixel 
(probability) density function or PDF. For 3D (three-dimensional) images, we 
can consider 2D marginals, based on the voxels that constitute the 3D image 
volume, or also a ID overall marginal. An image is representative of a signal. 
More loosely a signal is just data, mostly here with neccessary sequence or 
adjacency relationships. Often we will use interchangeably the terms image, 
image volume if relevant, signal and data. 

The word "model" is used, in general, in many senses - statistical [TB] . 
mathematical, physical models; mixture model; linear model; noise model; neu- 
ral network model; sparse decomposition model; even, in different senses, data 
model. In practice, firstly and foremostly for algorithmic tractability, models 
of whatever persuasion tend to be homogeneous. In this article we wish to 
broaden the homogeneous mixture model framework in order to accommodate 
heterogeneity at least as relates to resolution scale. Our motivation is to have 
a rigorous model-based approach to data clustering or segmentation, that also 
and in addition encompasses resolution scale. 

In Figure[l]p' , the clustering task is portrayed in its full generality. One way 
to address it is to build up parametrized clusters, for example using a Gaussian 
mixture model (GMM), so that the cluster "pods" are approximated by the 
mixture made up of the cluster component "peas" (a viewpoint expressed by 
A.E. Raftery, quoted in [27J). 

A step beyond a pure "peas" in a "pod" approach to clustering is a hierar- 
chical approach. Application specialists often consider hierarchical algorithms 
as more versatile than their partitional counterparts (for example, k-means or 
Gaussian mixture models) since the latter tend to work well only on data sets 
having isotropic clusters [23]. So in [20], we segmented astronomical images of 
different observing filters, that had first been matched such that they related to 
exactly the same fields of view and pixel resolution. For the segmentation we 
used a Markov random field and Gaussian mixture model; followed by a within- 
segment GMM clustering on the marginals. Further within-cluster marginal 
clustering could be continued if desired. For model fit, we used approxima- 
tions of the Bayes factor: the pseudo-likelihood information criterion to start 
with, and for the marginal GMM work, a Bayesian information criterion. This 
hierarchical- or tree-based approach is rigorous and we do not need to go be- 
yond the Bayes factor model evaluation framework. The choice of segmentation 
methods used was due to the desire to use contiguity or adjacency information 
whenever possible, and when not possible to fall back on use of the marginal. 
This mixture of segmentation models is a first example of what we want to 
appraise in this work. 

What now if we cannot (or cannot conveniently) match the images before- 
hand? In that case, segments or clusters in one image will not necessarily 
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A Generic Machine-Assisted Discovery Problem: 
Data Mapping and a Searcli for Outliers 




Figure 1: Clusters of all morphologies are sought. Figure courtesy of George 
Djorgovski, Caltech. 
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correspond to corresponding pixels in another image. That is a second example 
of where we want to evaluate different families of models. 

A third example of what we want to cater for in this work is the use of wavelet 
transforms to substitute for spatial modeling (e.g. Markov random field model- 
ing). In this work one point of departure is a Gaussian mixture model (GMM) 
with model selection using the Bayes information criterion (BIG) approxima- 
tion to the Bayes factor. We extend this to a new hierarchical context. We use 
GMMs on resolution scales of a wavelet transform. The latter is used to provide 
resolution scale. Between resolution scales we do not seek a strict subset or 
embedding relationship over fitted Gaussians, but instead accept a lattice rela- 
tion. We focus in particular on the global quality of fit of this wavelet-transform 
based Gaussian modeling. We show that a suitable criterion of goodness of fit 
for cross-model family evaluations is given by Renyi quadratic entropy. 

1.1 Outline of the Article 

In section 2 we review briefly how modeling, with Gaussian mixture modeling 
in mind, is mapped into information. 

In section 3 we motivate Gaussian mixture modeling as a general clustering 
approach. 

In section 4 we introduce entropy and focus on the additivity property. This 
property is importatant to us in the following context. Since hierarchical cluster 
modeling, not well addressed or supported by Gaussian mixture modeling, is of 
practical importance we will seek to operationalize a wavelet transform approach 
to segmentation. The use of entropy in this context is discussed in section 5. 

The fundamental role of Shannon entropy together with some other defini- 
tions of entropy in signal and noise modeling is reviewed in section 6. Signal 
and noise modeling are potentially usable for image segmentation. 

For the role of entropy in image segmentation, section 2 presented the state 
of the art relative to Gaussian mixture modeling; and section 6 presented the 
state of the art relative to (segmentation-relevant) filtering. 

What if we have segmentations obtained through different modelings? Sec- 
tion 7 addresses this through the use of Renyi quadratic entropy. Finally, section 
8 presents a case study. 

2 Minimum Description Length and Bayes In- 
formation Criterion 

For what we may term a homogenous modeling framework, the minimum de- 
scription length, MDL, associated with Shannon entropy ,25], will serve us well. 
However as we will now describe it does not cater for hierarchically embedded 
segments or clusters. An example of where hierarchical embedding, or nested 
clusters, come into play can be found in ^20J . 

Following Hansen and Yu [T2], we consider a model class, 0, and an instanti- 
ation of this involving parameters 9 to be estimated, yielding 9. We have 6* e M*^ 
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so the parameter space is fc-dimensional. Our observation vectors, of dimension 
m, and of cardinality n, are defined as: X = {xi\l < i < n}. A model, M, is 
defined as f{X\e),0 e 6 C M'^',X = {x^\l <i<n},x,e C K™. The 

maximum likelihood estimator (MLE) of is 6*: 9 — argmaxg/(X|0). 

Using Shannon information, the description length of X based on a set of 
parameter values 9 is: — log f{X\9). We need to transmit parameters also (as, 
for example, in vector quantization). So overall code length is: — log f{X\9) + 
L{9). If the number of parameters is always the same, then L{9) can be constant. 
Minimizing — log/(X|6') over 9 is the same as maximizing f{X\9), so if L{9) 
is constant, then MDL (minimum description length) is identical to maximum 
likelihood, ML. 

The MDL information content of the ML, or equally Baycsian maximum a 
posteriori (MAP) estimate, is the code length of — log/(X|6') + L{9). First, we 
need to encode the k coordinates of 9, where k is the (Euclidean) dimension 
of the parameter space. Using the uniform encoder for each dimension, the 
precision of coding is then Ij \fn implying that the magnitude of the estimation 
error is 1 j \fn. So the price to be paid for communicating is fc ■ (— log 1 / y^) = 
I log n nats [12] . Going beyond the uniform coder is also possible with the same 
outcome. 

In summary, MDL with simple suppositions here (in other circumstances we 
could require more than two stages, and consider other coders) is the sum of 
code lengths for (i) encoding data using a given model; and (ii) transmitting 
the choice of model. The outcome is minimal — \ogj(X\9) + | logn. 

In the Bayesian approach we assign a prior to each model class, and then we 
use the overall posterior to select the best model. Schwarz's Bayesian Informa- 
tion Criterion (BIC), which approximates the Bayes factor of posterior ratios, 
takes the form of the same penalized likelihood, — log j(X\9) + | logn, where 
9 = ML or MAP estimate of 9. See [7] for case studies using BIC. 

3 Segmentation of Arbitrary Signal through a 
Gaussian Mixture Model 

Notwithstanding the fact that often signal is not Gaussian, cf. the illustra- 
tion of Figure [l] we can fit observational data - density / with support in to- 
dimensional real space, M™ - by Gaussians. Consider the case of heavy tailed 
distributions. 

Heavy tailed probability distributions, examples of which include long mem- 
ory or 1// processes (appropriate for financial time series, telecommunications 
traffic flows, etc.) can be modeled as a generalized Gaussian distribution (CCD, 
also known as power exponential, a-Gaussian distribution, or generalized Lapla- 
cian distribution): 
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where 

- scale parameter, a, represents the standard deviation, 

- the gamma function, T(a) — x°'~^e~^dx, and 

- shape parameter, (j, is the rate of exponential decay, /3 > 0. 

A value of /? = 2 gives us a Gaussian distribution. A value of /3 = 1 gives a 
double exponential or Laplace distribution. For < /3 < 2, the distribution is 
heavy tailed. For /9 > 2, the distribution is light tailed. 

Heavy tailed noise can be modeled by a Gaussian mixture model with enough 
terms [1]. Similarly, in speech and audio processing, low-probability and large- 
valued noise events can be modeled as Gaussian components in the tail of the 
distribution. A fit of this fat tail distribution by a Gaussian mixture model is 
commonly carried out As in Wang and Zhao [3T], one can allow Gaussian 
component PDFs to recombine to provide the clusters which are sought. These 
authors also found that using priors with heavy tails, rather than using standard 
Gaussian priors, gave more robust results. But the benefit appears to be very 
small. 

Gaussian mixture modeling of heavy tailed noise distributions, e.g. genuine 
signal and flicker or pink noise constituting a heavy tail in the density, is there- 
fore feasible. A solution is provided by a weighted sum of Gaussian densities 
often with decreasing weights corresponding to increasing variances. Mixing 
proportions for small (tight) variance components are large (e.g., 0.15 to 0.3) 
whereas very large variance components have small mixing proportions. 

Figures[2]and[3]illustrate long-tailed behavior and show how marginal density 
Gaussian model fitting works in practice. The ordinates give frequencies. See 
further discussion in [19l [18] . 

4 Additive Entropy 

Background on entropy can be found e.g. in [24]. Following Hartley's 1928 
treatment of equiprobable events. Shannon in 1948 developed his theory around 
expectation. In 1960 Renyi developed a recursive rather than linear estimation. 
Various other forms of entropy are discussed in [B] . 
Consider density / with support in M™. Then: 

• Shannon entropy: Hs = — J f{x)log f{x)dx 

• Renyi entropy: Hj^^ — log J f{x)"dx for a > 0, a 7^ 1. 

We have: linia tiHua = Hs- So Hri — Hs- We also have: Hrp > Hs > 

H]^^ for < /3 < 1 and 1 < 7 (see e.g. |3J, section 3.3). When a 2, i7;^2 is 
quadratic entropy. 

Both Shannon and Renyi quadratic entropy are additive, a property which 
will be availed of by us below for example when we we define entropy for a linear 
transform, i.e. an additive, invertible decomposition. 

To show this, let us consider a system decomposed into independent events, 
A, B. 
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Figure 2: Upper left: long-tailed histogram of marginal density of product of 
wavelet scales 4 and 5 of a 512 x 512 Lena image. Upper right, lower left, and 
lower right: histograms of classes 1, 2 and 3. These panels exemplify a nested 
model. ^ 
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Figure 3: Overplotting of the histograms presented in Figure [2| This shows 
how the classes reconstitute the original data. The histogram of the latter is 
the upper left one in Figure [2] 



So p{AB) (alternatively written: p{AkB) orp{A + B)) =p{A)p{B). Shan- 
non information is then Ig^ = — logp{AB) = — logp{A) — logp{B), so the in- 
formation of independent events is additive. Multiplying across by p{AB), and 
taking p{AB) = p{A) when considering only event A and similarly for B, leads 
to additivity of Shannon entropy for independent events, Hg^ = Hg + Hg . 

Similarly for Renyi quadratic entropy we use p{AB) = p{A)p{B) and we 
have: -logp^iAB) = -2logp{AB) = -2 log (p(A)p(B)) = -2logp{A) - 
2\ogp{B) = -logp2(A) _ \ogp\B). 

5 The Entropy of a Wavelet Transformed Signal 

The wavelet transform is a resolution-based decomposition - hence with an 
in-built spatial model: see e.g. [50] . 

A redundant wavelet transform is most appropriate, even if decimated alter- 
natives can be considered straightforwardly too. This is because segmentation, 
taking information into account at all available resolution scales, simply needs 
all available information. A non-redundant (decimated, e.g., pyramidal) wavelet 
transform is most appropriate for compression objectives, but it can destroy 
through aliasing potentially important faint features. 

If / is the original signal, or images, then the following family of redundant 
wavelet transforms includes various discrete transforms such as the isotropic, 
B3 spline, a trous transform, called the starlet wavelet transform in |5D] . 

s 

f ^^Ws + Ws+l (1) 
s=l 

where: ws+i is the smooth continuum, not therefore wavelet coefficients; Ws are 
wavelet coefficients at scale s. Dimensions of f ^Wg^ws+i are all identical. 

Nothing prevents us having a redundant Haar or, mutatis mutandis, redun- 
dant biorthogonal 9/7 wavelet transform (used in the JPEG-2000 compression 
standard). As mentined above, our choice of starlet transform is due to no 
damage being done, through decimation, to faint features in the image. As 
a matched filter the starlet wavelet function is appropriate for many types of 
biological, astronomical and other images [3D]. 

Define the entropy, H , of the wavelet transformed signal as the sum of the 
entropies H" at the wavelet resolution levels, s: 

H = J2h' (2) 

s = l 

Shannon and quadratic Renyi entropies are additive, as noted in section 
4. For additivity, independence of the summed components is required. A 
redundant transform does not guarantee independence of resolution scales, s = 
1,2,..., 5. However in practice we usually have approximate independence. 
Our argument in favor of bypassing indepence of resolution scales is based on 
the practical and interpretation-related benefits of doing so. 
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Next we will review the Shannon entropy used in this context. Then we 
will introduce a new application of the Rcnyi quadratic entropy, again in this 
wavelet transform context. 



6 Entropy Based on a Wavelet Transform and a 
Noise Model 

Image filtering allows, as a special case, thresholding and reading off segmented 
regions. Such approaches have been used for very fast - indeed one could say 
with justice, turbo-charged - clustering. See PTI [22] . 

Noise models are particularly important in the physical sciences (cf. CCD, 
charge-coupled device, detectors) and the following approach was developed in 
[25] . Observed data / in the physical sciences are generally corrupted by noise, 
which is often additive and which follows in many cases a Gaussian distribution, 
a Poisson distribution, or a combination of both. Other noise models may also 
be considered. Using B ayes' theorem to evaluate the probability distribution of 
the realization of the original signal g, knowing the data /, we have 



P(.l/) - "^^^^^ (3) 
P(/) 

p(/|f/) is the conditional probability distribution of getting the data / given an 
original signal g, i.e. it represents the distribution of the noise. It is given, in 
the case of uncorrelated Gaussian noise with variance cr^, by: 




p(/|g) = exp - > '-hr^\ (4) 



The denominator in equation ([3| is independent of g and is considered as a 
constant (stationary noise). p(.g) is the a priori distribution of the solution 
g. In the absence of any information on the solution g except its positivity, a 
possible course of action is to derive the probability of g from its entropy, which 
is defined from information theory. 

If we know the entropy H of the solution (we describe below different ways 
to calculate it), we derive its probability by 

p(g) = exp(-ai/(g)) (5) 

Given the data, the most probable image is obtained by maximizing p((?|/). 
This leads to algorithms for noise filtering and to deconvolution (3^. 

We need a probability density p{g) of the data. The Shannon entropy, Hs 
|26j . is the summing of the following for each pixel, 

Hs{g) ^ "^Pk log Pk (6) 

fe=l 
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where X — {gi, ..(7„} is an image containing integer values, iVf, is the number of 

possible values of a given pixel gk (256 for an 8-bit image), and the pk values 

are derived from the histogram of g as pk — where ruk is the number of 

occurrences in the histogram's kth bin. 

The trouble with this approach is that, because the number of occurrences 

_ 1 

is finite, the estimate pk will be in error by an amount proportional to nif. ^ 
[S]. The error becomes significant when is small. Furthermore this kind of 
entropy definition is not easy to use for signal restoration, because its gradient 
is not easy to compute. For these reasons, other entropy functions are generally 
used, including: 



Burg P]: 



Frieden [5]: 



Hsig) = -J2Hgk) (7) 



k = l 



Hp{g) = -J29kH9k) (8) 



k=l 



• Gull and SkiUing llj: 

Hg{9) = J2 9^ ~Mk-gk In (9) 

where M is a given model, usually taken as a fiat image 

In all definitions n is the number of pixels, and k represents an index pixel. 
For the three entropies above, unlike Shannon's entropy, a signal has maximum 
information value when it is flat. The sign has been inverted (see equation (|5])), 
to arrange for the best solution to be the smoothest. 

Now consider the entropy of a signal as the sum of the information at each 
scale of its wavelet transform, and the information of a wavelet coefficient is 
related to the probability of it being due to noise. Let us look at how this 
definition holds up in practice. Denoting h the information relative to a single 
wavelet coefficient, we define 

]=i k=i 

with the information of a wavelet coefficient, h(wj^k) = ^^'<^p{'Wj^k), (Burg's 
definition rather than Shannon's). I is the number of scales, and rij is the 
number of samples in wavelet band (scale) j. For Gaussian noise, and recalling 
that wavelet coefficients at a given resolution scale are of zero mean, we get 

2 
W- 

h{wj,k) = Tpy + constant (11) 
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where aj is the noise at scale j. When we use the information in a functional to 
be minimized (for filtering, deconvolution, thresholding, etc.), the constant term 
has no effect and we can omit it. We see that the information is proportional 
to the energy of the wavelet coefficients. The larger the value of a normalized 
wavelet coefficient, then the lower will be its probability of being noise, and the 
higher will be the information furnished by this wavelet coefficient. 
In summary, 

• Entropy is closely related to energy, and as shown can be reduced to it, in 
the Gaussian context. 

• Using probability of wavelet coefficients is a very good way of addressing 
noise, but less good for non-trivial signal. 

• Entropy has been extended to take account of resolution scale. 

In this section we have been concerned with the following view of the data: 
/ = g + a + e where observed data / is comprised of original data 5, plus 
(possibly) background a (flat signal, or stationary noise component), plus noise 
e. The problem of discovering signal from noisy, observed data is important and 
highly relevant in practice but it has taken us some way from our goal of cluster 
or segmentation modeling of / - which could well have been cleaned and hence 
approximate well g prior to our analysis. 

An additional reason for discussing the work reported on in this section is 
the common processing platform provided by entropy. 

Often the entropy provides the optimization criterion used (see p^O j [24 | 129]. 
and many other works besides). In keeping with entropy as having a key role 
in a common processing platform we instead want to use entropy for cross- 
model selection. Note that it complements other criteria used, e.g. ML, least 
squares, etc. We turn now towards a new way to define entropy for application 
across families of GMM analysis, wavelet transform based approaches, and other 
approaches besides, all with the aim of furnishing alternative segmentations. 

7 Model-Based Renyi Quadratic Entropy 

Consider a mixture model: 

k k 

fix) = "*/»(^) ^it^ E = 1 (12) 

1=1 1=1 

Here / could correspond to one level of a mutiple resolution transformed signal. 
The number of mixture components is k. 
Now take fi as Gaussian: 

Mx)^Mx\fi,V) = (^{2n)-^\V,\-'-)cxp(-^{x-fi,)V-\x-fi,Y] (13) 
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where x, e M™, e M'"^™. 
Take 

- afl (14) 

(/ = identity) for simplicity of the basic components used in the model. 

A (i) parsimonious (ii) covering of these basic components can use a BIG 
approximation to the Bayes factor (see section 2) for selection of model, k. 

Each of the functions fi comprising the new basis for the observed density / 
can be termed a radial basis |15j . A radial basis network, in this context, is an 
iterative EM-like fit optimization algorithm. An alternative view of parsimony 
is the view of a sparse basis, and model fitting is sparsification. This theme of 
sparse or compressed sampling is pursued in some depth in |30j . 

We have: 

/oo 
a2Mx\lii,Vi) . ajfj{x\iij,Vj) dx (15) 
-OO 

= aitti/ij^Mi - Mjj + Vj) (16) 

See [53] or [TU]. Consider the case - apropriate for us - of only distinct clusters 
so that summing over i,j we get: 

^^(1 - 5,j)a^ajf,j{^ii - fij,Vi + Vj) (17) 



Hence 



/oo 
f{xY dx (18) 
-oo 



can be written as: 

/oo 
aifi{x\^ii,V.i) . ajfj{x\fij,Vj) dx (19) 
-OO 

= - log ^ - 5ij)a,ajf,j{^ii - /ij, + Vj) (20) 

i 3 

= - losE E(l - 5^3)Ml^^ - 2^2/) (21) 



from restriction (14 1 and also restricting the weights, ai^aj = l,Vi ^ j. The 
term we have obtained expresses interactions beween pairs. Function is a 
Gaussian. There are evident links here with Parzen kernels [U [T3] and clustering 
through mode detection (see e.g. [1^, and [IT] and references therein). 



For segmentation we will simplify further expression ( 20 1 to take into account 
just the equiweighted segments reduced to their mean (cf. |4j). 

In line with how we defined mutiple resolution entropy in section 6, we 
can also define the Renyi quadratic information of wavelet transformed data as 
follows: 

s 

Hr2 = E -^«2 (22) 
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8 Case Study 



8.1 Data Analysis System 

In this work, we have used MRI (magnetic resonance imaging) and PET (positron 
emission tomography) image data volumes, and (in a separate study) a data vol- 
ume of galaxy positions derived from 3D cosmology data. A 3D starlet or B3 
spline a trous wavelet transform is used with these 3D data volumes. Figures 
|4] and |5] illustrate the system that we built. For 3D data volumes, we support 
the following formats: FITS, ANALYZE (.img, .hdr), coordinate data (x, y, z), 
and DICOM; together with AVI video format. For segmentation, we cater for 
marginal Gaussian mixture modeling, of a 3D image volume. For multivalued 
3D image volumes (hence 4D hypervolumes) we used Gaussian mixture model- 
ing restricted to identity variances, and zero covariances, i.e. k-means. Based 
on a marginal Gaussian mixture model, BIG is used. Renyi quadratic entropy 
is also supported. A wide range of options are available for presentation and 
display (traversing frames, saving to video, vantage point XY or YZ or XZ, 
zooming up to 800%, histogram equalization by frame or image volume). The 
software, MR3D version 2, is available for download at www.multiresolution.tv. 
The wavelet functionality requires a license to be activated, and currently the 
code has been written for PGs running Microsoft Windows only. 

8.2 Segmentation Algorithms Used 

Gonsider Tl, an aggregated representative brain, derived from MRI data. It 
is of dimensions 91 x 109 x 91. See Figure [4] In the work described here as 
image format for the 3D or 4D image volumes we used the FITS, Flexible Image 
Transport System, format. 

The first model-based segmentation was carried out as follows. 

• We use "Marginal Range" in the "Segmentation" pull-down menu to de- 
cide, from the plot produced, that the BIG criterion suggests that a 6 
cluster solution is best. 

• Then we use "Marginal" with 6 clusters requested, again in the "Segmen- 
tation" pull-down menu. Save the output as: Tl_segm_marg6.fits. 

Next an alternative model-based segmentation is carried out in wavelet 
space. 

• Investigate segmentation in wavelet space. First carry out a wavelet trans- 
form. The B3 spline a trous wavelet transform is used with 4 levels (i.e. 
3 wavelet resolution scales). The output produced is in files: Tl_l.fits, 
Tl_2.fits, Tl_3.fits, Tl_4.fits. 

• Use the wavelet resolution scales as input to "K-Means" , in the "Seg- 
mentation" pull-down menu. Specify 6 clusters. We used 6 clusters 
because of the evidence suggested by BIG in the former modeling, and 
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hence for comparability between the two modeUngs. Save the output as: 
T 1 _segm JcmeanS .fits. 

8.3 Evaluation of Two Segmentations 

We have two segmentations. The first is a segmentation found from the voxel's 
marginal distribution function. The second outcome is a segmentation found 
from the multivalued 3D (hence 4D) wavelet transform. 

Now we will assess Tl_segm_marg6 versus Tl^egm_kmean6. If we use BIG, 
using the Tl image and first one and then the second of these segmented images, 
we find essentially the same BIG value. (The BIG values of the two segmenta- 
tions differ in about the 12th decimal place.) Note though that the model used 
by BIG is the same as that used for the marginal segmentation; but it is not 
the same as that used for k-means. Therefore it is not fair to use BIG to assess 
across models, as opposed to its use within a family of the same model. 

Using Rcnyi quadratic entropy, in the "Segmentation" pull-down menu, we 
find 4.4671 for the marginal result, and 1.7559 for the k-means result. 

Given that parsimony is associated with small entropy here, this result points 
to the benefits of segmentation in the wavelet domain, i.e. the second of our two 
modeling outcomes. 

9 Conclusions 

We have shown that Renyi quadratic entropy provides an effective way to com- 
pare model families. It bypasses the limits of intra- family comparison, such as 
is offered by BIG. 

We have offered some preliminary experimental evidence too that direct 
unsupervised classification in wavelet transform space can be more effective 
than model-based clustering of derived data. Intended by the latter ("derived 
data") are marginal distributions. 

Our innovative results are very efficient from computational and storage 
viewpoints. The wavelet transform for a fixed number of resolution scales is 
computationally linear in the cardinality of the input voxel set. The pairwise 
interaction terms feeding the Renyi quadratic entropy are also efficient. For 
both of these aspects of our work, iterative or other optimization is not called 
for. 
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